Ways to Improve N-gram Language Models for Ocr and Speech Recognition of Slavic Languages
نویسندگان
چکیده
The problems of n-gram models for the OCR and speech recognition for the Slavic languages are investigated. The paper proposes methods applicable for most Slavic languages. Two approaches are tested: filtering of the n-gram model and the alternative ways of carrying out the smoothing. The filtering relies on heuristics based on frequencies and morphological features of words. The smoothing uses classes based on morphological features in combinations with new discounting formula. The smoothing can also be combined with inner filtering. The numerical experiments for the Ukrainian language show that both approaches produce interesting results. However, smoothing is more promising while being more complex and requiring further investigation of development of proper classes based on morphological information in order to outperform standard smoothing techniques.
منابع مشابه
Cache-Augmented Latent Topic Language Models for Speech Retrieval
We aim to improve speech retrieval performance by augmenting traditional N-gram language models with different types of topic context. We present a latent topic model framework that treats documents as arising from an underlying topic sequence combined with a cache-based repetition model. We analyze our proposed model both for its ability to capture word repetition via the cache and for its sui...
متن کاملمقایسه روش های طیفی برای شناسایی زبان گفتاری
Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...
متن کاملLemmatized Latent Semantic Model for Language Model Adaptation of Highly Inflected Languages
We present a method to adapt statistical N-gram models for large vocabulary continuous speech recognition of highly inflected languages. The method combines morphological analysis, latent semantic analysis (LSA) and fast marginal adaptation for building topic-adapted trigram models, based on a background language model and very short adaptation texts. We compare words, lemmas and morphemes as b...
متن کاملSpeech Recognition on English-Mandarin Code-Switching Data using Factored Language Models - with Part-of-Speech Tags, Language ID and Code-Switch Point Probability as Factors pdfsubject=Multilingual Speech Recognition
Code-switching is defined as ”the alternate use of two or more languages in the same utterance or conversation” [1]. CS is a wide-spread phenomenon in multilingual communities, where multiple languages are concurrently used in a conversation. For automatic speech recognition (ASR), particularly intra-sentential code-switching poses an interesting challenge due to the multilingual context for la...
متن کاملJezikovno neodvisno modeliranje pregibnega jezika
This article concerns statistical language modelling of Slovenian language for automatic speech recognition. We investigate various techniques for overcoming the difficulties in modelling highly inflected languages. Slavic languages are particularly challenging languages and Slovenian language is one of them. Two main problems arise when modelling Slovenian language in comparison to English. Th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014